Predicting Speed Dating Matches

Author: Anil Kumar

  1. Project Motivation

  2. Dataset

  3. Analysis Plan

  4. Preprocessing and Exploratory Data Analysis

  5. Modeling and Results

  6. Conclusions and Further Directions

  7. References

1. Project Motivation

Speed Dating is an organized event where participants meet multiple potential suitors in a relatively short period of time. Participants meet with each other on short "dates" that usually last 3-8 minutes. At the end of each round, participants rotate to another date, and at the end of the session participants submit a list of people that they are interested in seeing again. If a pair of participants both agree to meet up outside the scope of the dating service, they are considered a match and their contact information is given to each other after a few days.1

For speed dating services, it could be beneficial to be able to predict matches based on participant survey responses about themselves, their preferences, and opinions about a potential suitor. Dating services could use the data to gain insights into what factors really motivate whether two people will match. With this information, services could be further personalized to clients in order to give them the best experience and more importantly the best opportunities to find matches.

2. Dataset

The dataset was obtained from here, DataCamp's careerhub-data repository on GitHub. The original files and data dictionary can be viewed at the link as well. A specific speed dating service from which the data came was not disclosed. The data consists of multiple participant survey responses from speed dating encounters, and also whether the encounters ended up in a match (target attribute). Each observation consists of participant and partner responses to questions regarding interests, preferences, and opinions about the other person from a speed dating encounter. Demographic features such as race, gender, and age are also included. For this dataset, it appears that each participant meets with 20 potential partners, as deduced from the data dictionary.

In summary, the data has 61 features and a binary target (NOTE: Not all the original features listed in the data dictionary are eventually used for modeling and this topic is discussed further in the Preprocessing and Exploratory Data Analysis section). A summary of the attributes is below. For the numeric features, some have different scales, and these are indicated below. Higher ratings represent more positive or stronger opinions. An incidental finding is that there are spelling errors in three columns (sinsere_o should be sincere_o; ambitous_o should be ambitious_o; intellicence_important should be intelligence_important). However, these spelling errors are irrelevant with respect to creation of models or their interpretation, and are merely observational.

In the summary below, "person" refers to a participant, and "partner" refers to their potential suitor from a speed dating encounter.


TARGET (0=Non-Match, 1=Match):


BINARY FEATURES (0=No, 1=Yes):


CATEGORICAL FEATURES:


NUMERIC FEATURES ON 0-10 SCALE:


NUMERIC FEATURES ON 0-100 SCALE:


OTHER NUMERIC FEATURES:

3. Analysis Plan

This is a supervised learning binary classification problem. Therefore, models appropriate for binary classification are implemented. Specifically, three types of modeling techniques are implemented and compared:

  1. Logistic Regression
  2. Random Forest
  3. Extreme Gradient Boosting

A Logistic Regression model is a good baseline for a binary classification problem as its complexity and computational time is minimal compared to ensemble tree-based methods or other more complex models. For the Random Forest and Extreme Gradient Boosting models, hyperparameter tuning through randomized grid searches is utilized. This is an imbalanced classification problem since there are approximately five times as many non-matches as matches. Due to this fact, Area under the ROC curve (AUC) and Log Loss will be used as the evaluation metrics.2 Additionally, a confusion matrix will be utilized to calculate accuracy, precision, recall, specificity, and F1 score, for the model that has the best performance.

4. Preprocessing and Exploratory Data Analysis

Initial inspection and preprocessing of the data is performed to assess and deal with any data quality issues that need to be resolved prior to further exploratory data analysis (EDA) and modeling.

4.1 Initial Inspection and Preprocessing

NOTE: If viewing the .pdf version of this notebook, three of the tables in this section are too wide to be viewed in their entirety due to the high dimensionality of the dataset. These include a table showing the first ten rows of the dataset and two tables showing summary statistics of the numeric features. Please reference the .ipynb notebook file for full views and the ability to side-scroll through these tables.

The necessary modules are imported and the .csv file is read in. The dataset is then inspected for missing values.

It appears that there are no missing values in any columns. However, this may not be the case, as missing values could be coded as something other than NaN's. So, the value counts for all of the columns are inspected, and it is clear from the summary below that a missing value is represented by a '?' in the dataset.

The .csv file is read in again. However, this time '?'s are converted to NaN's.


Upon initial inspection of the dataset, it appears that there are many more columns (features) than listed in the data dictionary. The dataset has a total of 8,378 observations (rows) and 123 columns including the target column, match. After further inspection, it appears that for almost all of the numeric features, there is also a corresponding categorical feature.

For example, the feature attractive_partner, a numeric feature that represents a participant's rating of their partner's attractiveness on a scale from 0-10, is collapsed into the categorical feature d_attractive_partner, which categorizes the ratings into 3 bins: [0-5], [6-8], and [9-10]. The exception to this rule is d_age, which represents the age difference between a participant and partner. Though, there is a column d_d_age that discretizes d_age.

There are advantages and disadvantages to discretizing numeric (continuous) variables. A main advantage is that discrete variables are easier to interpret and can eliminate the influence of outliers in the data. A main disadvantage is potential loss of information. Based on the fact that information could be lost, it is decided to use the original numeric features. However, potentially running a model using the discretized features and comparing its performance to a model using numeric features would be useful, and could be completed in the future.

In addition to dropping the discretized features, the following features were also dropped: wave, decision, and decision_o. It is assumed that wave does not represent data relevant to predicting the target since it is simply the group number of a participant. Using decision and decision_o would cause target leakage as they are the decision of the participant and partner on each other (0=does not want to meet again, 1=wants to meet again). For example, if both decision and decision_o are equal to 1, that constitutes a match. Otherwise, it is a non-match. The column has_null contains a binary flag for each observation that indicates whether there is missing data for that observation. This feature is kept as it possible that knowledge of whether a row has missing values or not could be beneficial for a model when making a prediction.3

The dataset is reinspected after dropping these columns. Missing values are more easily detectable as NaN's since the '?'s were converted to NaN's. This is seen in the summary tables below.

As previously stated in the Dataset section, there are spelling errors in 3 columns:

However, these errors are irrelevant with respect to creation of models or their interpretation, and are merely observational.

To provide further aid in checking for any data quality issues, an inspection of the value counts and number of unique values for every column as well as a table of summary statistics of the numeric features are below.

A few observations related to data quality:

-met should either be 0 or 1 as it represents if the participant and partner have previously met. However, there are 8 observations where the value is greater than 1. Any values in met that are not either 0 or 1, will be replaced with a 0.
-field has 260 distinct values. This feature will be dropped as it has too many distinct values, and predictive power of this feature would be limited given its high cardinality.
-attractive_o has 1 value of 10.5 (the scale is 0-10, will change this entry to 10)
-funny_o has 1 value of 11.0 (the scale is 0-10, will change this entry to 10)
-gaming has 78 values of 14.0 (the scale is 0-10, will change these entries to 10)
-reading has 51 values of 13.0 (the scale is 0-10, will change these entries to 10)
-The features ending in '_important' or beginning in 'pref_' are on a 1-100 scale as opposed to a 1-10 scale (12 features in total)

The assumption is made that for above the features with 0-10 scales, that an invalid rating higher than 10 actually constitutes a 10, as there could be an issue with data entry. This assumption is a limitation in the sense that the invalid entries could also be replaced with NaN's and later imputed.

For met, values greater than 1 are assumed to be errors in data entry as well. Potentially, 0-10 scale ratings for another column could have been inadvertently placed in the met column. Since there are far more participants that did not previously meet (7,644) than did (351), and there are only 8 invalid entries, these invalid entries are replaced with 0's (assumes no previous meeting).


The dataset is reinspected for the count and percentage of missing values in each column.

The feature expected_num_interested_in_me is missing in 78.5% of rows (6,578 rows). Since there is so much missing data in this column, it is dropped.

4.2 Visual EDA

A visual exploration of the data can now be completed.

It is seen below that out of the 8,378 observations, 6,998 (83.52%) are non-matches and 1380 (16.47%) are matches. Since the target is imbalanced, a predictive model will be biased towards the more prevalent class (0 = non-match). As stated in the Analysis Plan section, this is the reason AUC and Log Loss are used as the evaluation metrics.

Next, kernel density estimate (KDE) plots are created for all of the numeric features to explore their distributions and to see if they differ between matched and non-matched participants. A KDE plot is essentially a "smoothed" histogram. Additionally, boxplots are also utilized to further investigate the distribution of the numeric features.

For the vast majority of the numeric features, the distributions are similar for matched and non-matched participants. Some noteworthy findings where the distributions and boxplots are noticeably different:

attractive_o and attractive_partner: higher values seen in matched participants
funny_o and funny_partner: higher values seen in matched participants
shared_interests_o and shared_interests_partner: higher values seen in matched participants
intelligence_o and intelligence_partner: higher values seen in matched participants
sinsere_o and sincere_partner: higher values seen in matched participants
like: higher values seen in matched participants
guess_prob_liked: higher values seen in matched participants

Intuitively, the above findings make sense, as people that find each other attractive, funny, and sincere, share similar interests, and like each other are more likely to match. Please refer back to the Dataset section for the detailed description of these features.

Next, to explore the binary features met, samerace, and has_null and their relationships to the target, match. This is accomplished through 2x2 contingency tables.

For met: ~10.4% couples that matched have previously met, compared to ~3.2% of the couples that didn't match have previously met
For samerace: ~41.0% couples that matched are the same race compared to ~39.2% of the couples that didn't match are the same race
For has_null: ~86.5% couples that matched have at least one missing value in a column compared to ~87.7% of the couples that didn't match have at least one missing value in a column

Based on these findings, it looks like previously meeting a potential partner has a bigger effect on matching than being the same race or having a missing entry in the survey.

Finally, to explore the categorical features gender, race, and race_o, and their relationship to the target, match. This is accomplished through count plots and contingency tables.

From the above countplots and contingency tables, it appears that the distribution of different levels of the categorical variables is fairly similar between matched and non-matched participants.

Based on the above visualizations and tables, it appears that these 13 features have the most significant differences in distributions between matched and non-matched participants:

Before moving onto modeling, the dataset is reexamined to deal with missing data.

For met it is assumed that if an entry is missing that a participant has not previously met the partner, so NaN's will be replaced with 0's.

For race and race_o, NaN's will be replaced with the word 'Unknown'.

For the numeric features with missing values, mean imputation is implemented. From the table below, it is seen that the mean and median are similar for all of the numeric features. If outliers were significantly affecting the mean, then median imputation might be more appropriate. As an aside, though mean or median imputation can bias the data and predictive model by underestimating the variance in the data, since the percentage of missing values is mostly small (except for shared_interests_o and shared_interests_partner both missing ~13%, and expected_num_matches missing ~14%), the assumption is made that imputation should not have too significant of an influence on the model. However, this is a limitation. The assumption is also made that the missing values are not MNAR (Missing Not at Random), as mean or median imputation would not be appropriate in these cases. It is more appropriate when data is MAR (Missing at Random) or MCAR (Missing Completely at Random).4

Mean imputation for missing values in the numeric columns is completed below. Technically, samerace, has_null, and met are represented as numeric values (0 or 1), but the mean imputation code does not affect these columns as at this point in the process none of these columns have missing values. If they did, imputation would need to be taken care of separately, as was completed for met.

There are no more missing values in any of the columns as shown below. Modeling can now be implemented.

5. Modeling and Results

This is a binary classification problem, so appropriate modeling techniques are implemented. The following methods are compared:

Prior to creating models, categorical features are label encoded as numeric so they can be taken in as input into the models. For ensemble tree-based models, one hot encoding is not necessary and is sometimes even detrimental to model performance and efficiency.5 One-hot encoding is beneficial for a logistic regression model as the label encoded categorical features should not be interpreted as having any particular order. Therefore, one-hot encoded categorical features are used for the logistic regression model, and label encoded categorical features for the tree-based models.

The dataset is split into training and validation sets (80% training, 20% validation). The split stratifies on the target so there are the equal proportions of the target in the training and validation sets, ensuring that the data remains equally imbalanced in both sets. Altering this imbalance could lead to the model making biased predictions for the validation set and with new data.

5.1 Logistic Regression

A Logistic Regression classifier is created and fit to the one-hot encoded training data. The Log Loss, AUC, and mean AUC and mean Log Loss from 5-fold cross validation are calculated for the training set. Next, the AUC and Log Loss are calculated for the validation set.

5.2 Random Forest

For the Random Forest, a randomized grid search is utilized to search for the hyperparameters that create the model with the best performance, with AUC used as the scoring metric. In this case, there are a total of 25 possible hyperparameter combinations given that max_depth has 5 possibilities and n_estimators has 5 possibilities in the code block below (5x5=25). At random, 15 of the 25 possible models are tested, and 5-fold cross validation is performed on each of these candidate models. For this particular case, a complete grid search of all 25 possible combinations would not be too much more costly with respect to computational time.

A summary of the chosen hyperparameters to tune is below:

From the search, it appears that {'n_estimators': 500, 'max_depth': 6} are the hyperparameters selected in the best model. This tuned Random Forest classifier (best_model_rf) is extracted from the search and then fit to the training data. The Log Loss, AUC, and mean AUC and mean Log Loss from 5-fold cross validation are calculated for the training set. After, the AUC and Log Loss are calculated for the validation set.

5.3 Extreme Gradient Boosting

For the Extreme Gradient Boosting model, a randomized grid search is utilized to search for the hyperparameters that create the model with the best performance, with AUC used as the scoring metric. In this case, there are a total of 3,840 possible hyperparameter combinations (5x4x4x4x4x3) given the grid in the code block below. To run all possible combinations would take too long, so 30 random models are tested, and 5-fold cross validation is performed on each of these candidate models. As previously stated, for this particular case, a complete grid search of all possible combinations would not be feasible with respect to computation time.

A summary of the chosen hyperparameters is below: 6 , 7

From the search, it appears that {'subsample': 0.4, 'n_estimators': 500, 'max_depth': 5, 'learning_rate': 0.05, 'lambda': 2, 'colsample_bytree': 1.0} are the hyperparameters for the best model. This tuned XGBoost classifier (best_model_xgb) is extracted from the search and then fit to the training data. The Log Loss, AUC, and mean AUC and mean Log Loss from 5-fold cross validation are calculated for the training set. After, the AUC and Log Loss are calculated for the validation set.

The XGBoost model performed better than the Random Forest and Logistic Regression models, as it has the smallest Log Loss and highest AUC. An AUC comparison figure and table comparing the evaluation metrics for all 3 models are below.

A confusion matrix of the XGBoost model is created along with an examination of the model's feature importances. Two functions to aid in these tasks are below. The make_confusion_matrix function was obtained from here.

Looking at the confusion matrix, the model has a much higher specificity (0.953) compared to recall (0.442). In simpler terms, the recall is the ability of the model to correctly identify matches, and the specificity is the ability of the model to correctly identify non-matches. Therefore, the model is much more easily able to identify non-matches in comparison to matches. More specifically, the model predicted 44.2% of the matches correctly and 95.3% of the non-matches correctly. The accuracy is 0.869, meaning for both matches and non-matches, the model predicts correctly 86.9% of the time. As previously stated, accuracy is not the best metric for datasets with target imbalance. The precision (also known as positive predictive value) is 0.649, meaning that if the model predicts a value of 1 (match), it is correct 64.9% of the time. The F1 Score essentially measures the balance between recall and precision, and can be interpreted as a weighted average of the precision and recall.8 , 9

\begin{equation} Accuracy = \frac{TP+TN}{TP+FP+TN+FN} \end{equation}


\begin{equation} Precision = \frac{TP}{TP+FP} \end{equation}
\begin{equation} Recall = \frac{TP}{TP+FN} \end{equation}
\begin{equation} Specificity = \frac{TN}{TN+FP} \end{equation}
\begin{equation} F1 Score = 2*\frac{Precision*Recall}{Precision+Recall} \end{equation}

Depending on the goals of the dating serving, the probability threshold for predicting a match can also be altered from the default probability threshold of 0.50. For example, with a threshold of 0.50 (confusion matrix above), there is high specificity and low recall. Suppose that the cost of a false positive is not too much, meaning that the service would rather err on the side of predicting a couple would match even if they would not. To achieve this, the probability threshold could be lowered, and this would increase the recall while sacrificing some specificity (false positives would increase and false negatives would decrease). Take for example the confusion matrix below where the probability threshold has been lowered to 0.35.

The accuracy remains similar, but the recall has increased from 0.442 to 0.583, while the specificity has reduced from 0.953 to 0.925. Of note, the F1-score increased while the precision is reduced. Essentially, lowering the threshold resulted in a bigger gain in recall than loss in specificity, so this seems like a better probability threshold. Varying thresholds can be tried to find the optimal threshold, depending on the goals of the dating service.

Going back to the visual EDA, the following features showed visibly different distributions between matched and non-matched participants:

As seen in the feature importance figure below, many of these features are also the most important predictive features of the XGBoost model.

6. Conclusions and Further Directions

Not surprisingly, the XGBoost model performed the best out of the three classification models with an AUC of 0.88 and Log Loss 0.30 in the validation set. The mean AUC on the training set using 5-fold cross validation was 0.87, so the model does not seem to be overfitting. The features that were expected to be the most important from EDA also corresponded to the most important features of the model. As previously discussed, the model is better at detecting non-matches than matches (specificity was much higher than recall), and this is expected given the target imbalance. Depending on what is more costly and/or beneficial from a business standpoint and customer satisfaction, the probability threshold can be adjusted accordingly for making predictions, and multiple payoff matrices can be analyzed for the best result. For example, decreasing the threshold results in more matches being identified, but some specificity is sacrificed. The goal would be to pick a threshold that seeks a balance between sensitivity and specificity, and results in the greatest payoff for the service from client satisfaction and financial standpoints.

Below are some other potential suggestions for future analysis/modeling:
-The model could be built with the categorical features (features prefixed by 'd_') and then compared to the performance of the model that used numeric features.
-Since XGBoost can handle missing values, the model could be built without imputing missing values and its performance could be compared. The missingness of the data could also be further explored and more complex imputation methods could be implemented depending on the results.
-The number of features in the model could be reduced from 61 to only the most important features. If this lower dimensional model performed equally well to the current model, it could be used in the future to reduce computational time. Additionally, the dating service could then start collecting data on only these most impactful features, which could reduce the time spent and cost of collecting data. Also, this measure could potentially reduce model overfitting as well as data quality issues, as shorter surveys could lead to less data entry errors and missing values.
-Creation of another model purely based on data from pre-date preferences, and not data obtained post-date. Depending on how well this model performed, it could be used to arrange people into groups that would have higher match success rates.

7. References

  1. Speed Dating
  2. Tour of Evaluation Metrics for Imbalanced Classification
  3. Add Binary Flags for Missing Values for Machine Learning
  4. Data Imputation: Beyond Mean, Median, and Mode
  5. One-Hot Encoding is making your Tree-Based Ensembles worse, here’s why?
  6. Fine Tuning XGBoost model
  7. XGBoost Parameters
  8. Confusion Matrix
  9. Accuracy, Recall, Precision, F-Score & Specificity, which to optimize on?

GO BACK TO TOP